In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce the number of attended keys; other ways to reduce complexity include locality-sensitive hashing, key pooling, additional memory to store information in compacted or hybridization with other architectures, such as CNN. Often based on a strong mathematical basis, kernelized approaches allow for the approximation of attention with linear complexity while retaining high accuracy. Therefore, in the present paper, we aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
translated by 谷歌翻译
在本文中,我们建议将广泛应用于基于变压器的模型广泛应用的DOT产品成对匹配层,对于模型性能是多余的。在其原始配方中的注意力必须被视为人类水平工具,以探索和/或可视化序列中的相关性分数。相反,我们在没有任何近似的情况下展示了一个简单而快速的替代方案,即据我们所知,从远程竞技场基准测试的几个任务上胜过现有的注意近似。
translated by 谷歌翻译